Explore and Summarize Data: Red Wine Dataset by John Ortiz ========================================================
For this project, I will analyze the red wine dataset. I will try to determine what values of each variable make for the quailty of the wine. First I will perform a univariate, bivariate, and multivariate analysis.
Dataset Background: This tidy data set contains 1,599 red wines with 11 variables on the chemical properties of the wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).
The dataset is related to a red variant of the Portuguese “Vinho Verde” wine. For more details, consult: http://www.vinhoverde.pt/en/ or the reference [Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).
These datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are munch more normal wines than excellent or poor ones). Outlier detection algorithms could be used to detect the few excellent or poor wines. Also, we are not sure if all input variables are relevant. So it could be interesting to test feature selection methods.
# Load the Data
wine <- read.csv('wineQualityReds.csv')
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
Let’s do some modifications to the dataset. Quality seems like it would best fit as a ordered factor. Let’s us also add a new ordered factored variable called ‘rating’ based on the quailty of the wine.
#Convert Quality to a Factor
wine$quality <- factor(wine$quality, ordered = T)
#Create 'Rating' as an ordered factor
wine$rating <- ifelse(wine$quality < 5, 'bad', ifelse(
wine$quality < 7, 'average', 'good'))
wine$rating <- ordered(wine$rating,
levels = c('bad', 'average', 'good'))
Let’s do a plot the distribution of each of the variable. Also, there are extreme outliers for many variables, let’s remove them so a better analysis can be performed.
INSIGHT: Most of the wines present in the dataset are average quality wines.
INSIGHT: Fixed Acidity is positively skewed. The median is around 8ish.
INSIGHT: There seems to be a Bimodal distribution for Volatile acidity, with peaks at around 0.375 and 0.625.
INSIGHT:The distribution of Citric acid looks a bit off. The seem to be missing data in the higher values. Maybe the data collected was incomplete?
INSIGHT: Residual Sugar is positively skewed with high peaks at around 2.2. There are a lot of outliers present at the higher values.
INSIGHT: Chlorides is positively skewed with high peaks at around 0.25.
INSIGHT: Free Sulphur Dioxide has a high peak at around 5. It is also positively skewed with a long tail. Some outliers in the high values are present
INSIGHT: Total Sulphur Dioxide has a high peak at around 25. It is also positively skewed with a long tail. Some outliers in the high values are present
INSIGHT: Density seems to have a normal Distribution.
INSIGHT: pH seems to have a normal Distribution.
INSIGHT: Sulphates has long tailed distribution.
INSIGHT: Alcohol has a skewed distribution.
This Dataset has 1599 rows and 14 columns.
Categorical variables are ‘quality’ and ‘rating’. The rest of the variables are numerical variables which detail the chemical and physical properties of the wine.
My main focus in this dataset project is the ‘quality’ of the wine. What is it excalty that drives the quailty of the wine.
investigation into your feature(s) of interest?
After looking at the distributions of the varible I think the acidity of the wine is the biggest factor of it’s quailty. Since pH is related to acidity, I also think it may have some effect on the quality. Residual sugar might also have an effect on wine quality, but from the little I know of wine, sugars should not have much of an effect on quality but on whether how ‘dry’ it is.
I did do some modifications to the dataset. Quality seems like it would best fit as a ordered factor so I converted it. I also added a new ordered factored variable called ‘rating’ based on the quailty of the wine.
Residual sugar and Chloride seems to have extreme outliers present. Density and pH looks to be normally distributed with few outliers. Total and free sulfur dioxides, alcohol and sulphates, and Fixed and volatile acidity seem to be long-tailed. What was unusual was Citric acid. It has large number of zero values! It also showed a unique distribution; a mostly rectangular shape.
For this section I think a correlation matrix should be created to start off. This matrix should give me an idea about which varibles may be correlated.
First let’s remove some columns and convert quality back to a number!
#remove some columns and convert quality back to a number
c <- cor(
wine %>%
dplyr::select(-X) %>%
dplyr::select(-rating) %>%
mutate(
quality = as.numeric(quality)
)
)
#create correlation matrix
emphasize.strong.cells(which(abs(c) > .3 & c != 1, arr.ind = TRUE))
pandoc.table(c)
##
## ---------------------------------------------------------------------------
## fixed.acidity volatile.acidity citric.acid
## -------------------------- --------------- ------------------ -------------
## **fixed.acidity** 1 -0.2561 **0.6717**
##
## **volatile.acidity** -0.2561 1 **-0.5525**
##
## **citric.acid** **0.6717** **-0.5525** 1
##
## **residual.sugar** 0.1148 0.001918 0.1436
##
## **chlorides** 0.09371 0.0613 0.2038
##
## **free.sulfur.dioxide** -0.1538 -0.0105 -0.06098
##
## **total.sulfur.dioxide** -0.1132 0.07647 0.03553
##
## **density** **0.668** 0.02203 **0.3649**
##
## **pH** **-0.683** 0.2349 **-0.5419**
##
## **sulphates** 0.183 -0.261 **0.3128**
##
## **alcohol** -0.06167 -0.2023 0.1099
##
## **quality** 0.1241 **-0.3906** 0.2264
## ---------------------------------------------------------------------------
##
## Table: Table continues below
##
##
## ------------------------------------------------------------------------------
## residual.sugar chlorides free.sulfur.dioxide
## -------------------------- ---------------- ------------ ---------------------
## **fixed.acidity** 0.1148 0.09371 -0.1538
##
## **volatile.acidity** 0.001918 0.0613 -0.0105
##
## **citric.acid** 0.1436 0.2038 -0.06098
##
## **residual.sugar** 1 0.05561 0.187
##
## **chlorides** 0.05561 1 0.005562
##
## **free.sulfur.dioxide** 0.187 0.005562 1
##
## **total.sulfur.dioxide** 0.203 0.0474 **0.6677**
##
## **density** **0.3553** 0.2006 -0.02195
##
## **pH** -0.08565 -0.265 0.07038
##
## **sulphates** 0.005527 **0.3713** 0.05166
##
## **alcohol** 0.04208 -0.2211 -0.06941
##
## **quality** 0.01373 -0.1289 -0.05066
## ------------------------------------------------------------------------------
##
## Table: Table continues below
##
##
## -----------------------------------------------------------------------------
## total.sulfur.dioxide density pH
## -------------------------- ---------------------- ------------- -------------
## **fixed.acidity** -0.1132 **0.668** **-0.683**
##
## **volatile.acidity** 0.07647 0.02203 0.2349
##
## **citric.acid** 0.03553 **0.3649** **-0.5419**
##
## **residual.sugar** 0.203 **0.3553** -0.08565
##
## **chlorides** 0.0474 0.2006 -0.265
##
## **free.sulfur.dioxide** **0.6677** -0.02195 0.07038
##
## **total.sulfur.dioxide** 1 0.07127 -0.06649
##
## **density** 0.07127 1 **-0.3417**
##
## **pH** -0.06649 **-0.3417** 1
##
## **sulphates** 0.04295 0.1485 -0.1966
##
## **alcohol** -0.2057 **-0.4962** 0.2056
##
## **quality** -0.1851 -0.1749 -0.05773
## -----------------------------------------------------------------------------
##
## Table: Table continues below
##
##
## -------------------------------------------------------------------
## sulphates alcohol quality
## -------------------------- ------------ ------------- -------------
## **fixed.acidity** 0.183 -0.06167 0.1241
##
## **volatile.acidity** -0.261 -0.2023 **-0.3906**
##
## **citric.acid** **0.3128** 0.1099 0.2264
##
## **residual.sugar** 0.005527 0.04208 0.01373
##
## **chlorides** **0.3713** -0.2211 -0.1289
##
## **free.sulfur.dioxide** 0.05166 -0.06941 -0.05066
##
## **total.sulfur.dioxide** 0.04295 -0.2057 -0.1851
##
## **density** 0.1485 **-0.4962** -0.1749
##
## **pH** -0.1966 0.2056 -0.05773
##
## **sulphates** 1 0.09359 0.2514
##
## **alcohol** 0.09359 1 **0.4762**
##
## **quality** 0.2514 **0.4762** 1
## -------------------------------------------------------------------
INSIGHT: Quality is strongly correlated to Alcohol and Volatile acidity.
INSIGHT: Density and Fixed Acidity have a very strong correlation.
INSIGHT: Naturally, Alcohol has negative correlation with density.
INSIGHT: Very unsual! Volatile acidity has a positive correlation with pH… I thought pH and acidity had a negative relationship!
It seems like Box plots between these variables are in order!
INSIGHT: Fixed Acidity has almost no effect on the Quality.
INSIGHT: Volatile acid and quality seem to have a negative coorelation.
INSIGHT: Citric acid and quailty seem to have a positive correlation.
INSIGHT: My previous thought that Residual Sugar may have no effect an effect on the wine quality seems to be true!
INSIGHT: There seems to be a weak negative correalation between chlorides and quality.
INSIGHT: Very low concentrations of Free Sulphur Dioxide produces low quality wine. And very high concentrations produces average wine.
INSIGHT: Very low concentrations of total Sulphur Dioxide produces low quality wine. And very high concentrations produces average wine.
INSIGHT: Lower densities seem to favor higher quailty wines.
INSIGHT: Higher Quailty wines seem to contain less pH
INSIGHT: Why are we seeing that Volatile Acid and pH have a positive correlation? Let’s pose this for further investigation.
INSIGHT: Even though we see many outliers in the ‘Average’ quality wine, it seems that better wines have a stronger concentration of Sulphates.
INSIGHT: Higher quality wines seem to have higher Alcohol content in it. Below are some statitics to help clarify.
##
## Call:
## lm(formula = as.numeric(quality) ~ alcohol, data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8442 -0.4112 -0.1690 0.5166 2.5888
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.12503 0.17471 -0.716 0.474
## alcohol 0.36084 0.01668 21.639 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7104 on 1597 degrees of freedom
## Multiple R-squared: 0.2267, Adjusted R-squared: 0.2263
## F-statistic: 468.3 on 1 and 1597 DF, p-value: < 2.2e-16
INSIGHT: The value of R squared seems to suggest that Alcohol contributes to only about 23% of the quality. What about the other 77%? Let continue to dig. Let’s do a correlation test quailty against each variable.
## fixed.acidity volatile.acidity citric.acid
## 0.12405165 -0.39055778 0.22637251
## log10.residual.sugar log10.chlordies free.sulfur.dioxide
## 0.02353331 -0.17613996 -0.05065606
## total.sulfur.dioxide density pH
## -0.18510029 -0.17491923 -0.05773139
## log10.sulphates alcohol
## 0.30864193 0.47616632
INSIGHT: The correlation test shows that the following variables have a high correlation to quality.
investigation. How did the feature(s) of interest vary with other features in the dataset? What was the strongest relationship you found?
Higher concentration of Citric Acid seems to produce higher quality. Higher alcohol percentages seems to produce higher quality. Lower percent of Chloride seems to produce higher quality. Lower densities seems to produce higher quality. The more acidic seems to produce higher quality. Lower Volatile Acidity seems to produce higher quality. Residual sugar seems to have almost no effect on quality. Fixed Acidity seems to have almost no effect on quality.
(not the main feature(s) of interest)?
For some reason volatile acidity has a positive correlation with pH…
The data shows that alcohol plays a strong part in the quality of the wine, so will make alcohol a constant and apply a few more variables to see if it effects quality.
INSIGHT: The quality of the alcohol does seem be effected by Density.
INSIGHT: It seems like a high the level of Sulphates paired with higher alcohol content produce higher quality.
INSIGHT: It seems like a low the level of Volatile acid paired with higher alcohol content produce higher quality.
INSIGHT: It seems like a low pH level paired with higher alcohol content produce higher quality.
INSIGHT: No correlation between residual sugar and quality keeping alcohol constant.
INSIGHT: Lower Sulphur Dioxide seems to produces better wine
Now we will see the effect of Acids on the Quality.
INSIGHT: Higher Citric Acid and low Volatile Acid seems to produce better Wines
INSIGHT: Nothing here.
INSIGHT: Nothing here.
Now after all these analysis, I am going to take the variables which are most strongly correlated with the quality of the wine and generate a linear model with them.
##
## Calls:
## m1: lm(formula = as.numeric(quality) ~ alcohol, data = tr_data)
## m2: lm(formula = as.numeric(quality) ~ alcohol + sulphates, data = tr_data)
## m3: lm(formula = as.numeric(quality) ~ alcohol + sulphates + volatile.acidity,
## data = tr_data)
## m4: lm(formula = as.numeric(quality) ~ alcohol + sulphates + volatile.acidity +
## citric.acid, data = tr_data)
## m5: lm(formula = as.numeric(quality) ~ alcohol + sulphates + volatile.acidity +
## citric.acid + fixed.acidity, data = tr_data)
## m6: lm(formula = as.numeric(quality) ~ alcohol + sulphates + pH,
## data = tr_data)
##
## ====================================================================================================
## m1 m2 m3 m4 m5 m6
## ----------------------------------------------------------------------------------------------------
## (Intercept) 0.155 -0.273 0.866*** 0.973*** 0.497 1.494**
## (0.220) (0.224) (0.247) (0.254) (0.287) (0.515)
## alcohol 0.333*** 0.320*** 0.286*** 0.284*** 0.296*** 0.339***
## (0.021) (0.021) (0.020) (0.020) (0.020) (0.021)
## sulphates 0.855*** 0.599*** 0.650*** 0.667*** 0.733***
## (0.126) (0.124) (0.127) (0.126) (0.129)
## volatile.acidity -1.153*** -1.279*** -1.352***
## (0.124) (0.143) (0.144)
## citric.acid -0.231 -0.629***
## (0.132) (0.174)
## fixed.acidity 0.058***
## (0.017)
## pH -0.569***
## (0.149)
## ----------------------------------------------------------------------------------------------------
## R-squared 0.209 0.245 0.308 0.310 0.319 0.256
## adj. R-squared 0.208 0.243 0.306 0.307 0.315 0.254
## sigma 0.707 0.691 0.662 0.661 0.657 0.686
## F 252.335 155.125 141.769 107.317 89.264 109.700
## p 0.000 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -1027.549 -1004.996 -963.139 -961.610 -955.575 -997.782
## Deviance 478.652 456.660 418.487 417.154 411.937 449.841
## AIC 2061.098 2017.992 1936.279 1935.219 1925.150 2005.565
## BIC 2075.695 2037.456 1960.608 1964.415 1959.211 2029.894
## N 959 959 959 959 959 959
## ====================================================================================================
investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest? Were there any interesting or surprising interactions between features?
High Alcohol coupled with high Sulaphate seems to produce higher quality. High Alcohol coupled with Citric Acid seems to show a weak positive correlation for higher quality.
and limitations of your model.
Using the data insights gathered, a linear model was created. But the model did suffer from a limitation. Which was that the majority of the data was for average quality wines. Therefore a significant confidence level in the equations was not produced. This is explained by the low R squared value obtained; alcohol only contributes to only 23% of the Quality. A more complete dataset with more ‘good’ and ‘bad’ quality wines is needed.
1- Alcohol and Sulphates seems to be important in producing higher wine quality.
2- The linear model showed us the variation in the error percentage with different qualities of Wine, so plot of error value against the quality should useful.
INSIGHT: It’s easy to see that Alcohol has a positive relationship with quality. The more the alcohol is present, the higher the quality. But the R Squared value shows that alcohol is only responsible for about 23% in the variance of the quality.
INSIGHT: This plot shows that wines with both high values for Alcohol percentage and Sulphates seems to yield higher quality wine.
INSIGHT: Since most of the data is for ‘Average’ quality wines, it makes sense that the error is super dense around that section. m5, the best fit linear model with the highest R squared value could only explain 32% change in quality. So to summarize, predicting both ‘Good’ and ‘Bad’ quality wines using this model is ill advised.
Since most of the data collected was for ‘average’ quality, the training set was not helpful in building a model that can accurately predict the quality of wine in the extreme ranges (‘bad’ and ‘good’ wines). A data set with more infomation on bad and good quality wines is needed.
My univariate analysis, showed that most variables displayed either a Positively skewed or a Normal Distribution. But the distribution of Citric acid looked a bit off, some wines samples had no citric acid present, seems that there is incomplete data present.
My bivariate analysis, plotting variables vs quality, showed that the most influential factors for quality were alcohol, sulphate and acid concentrations. Also a very weird relationship showed up, Volatile acidity had a positive correlation with pH… I thought pH and acidity had a negative relationship!
My multivariate analysis, showed that Alcohol has a positive relationship with quality. The more the alcohol is present, the higher the quality. But the R Squared value shows that alcohol is only responsible for about 23% in the variance of the quality. Density seemed to have no part in wine quality. And that high values for Alcohol percentage and Sulphates seems to yield higher quality wine.
Suggestions for the future: Maybe a more complete dataset should be collected. One with more values in the good and bad ranges. Also, Volatile acidity having a positive correlation with pH is super weird. Maybe an in-depth analysis of this should done.